arXiv:1507.04457vl [stat.ML] 16Jul2015 


Preference Completion: Large-scale Collaborative Ranking from 

Pairwise Comparisons 


Dohyung Park 

The University of Texas at Austin 
dhpark@utexas.edu 


Joe Neeman 

The University of Texas at Austin 
j oeneenian@gniail. com 


Jin Zhang 

The University of Texas at Austin 
zj@utexas.edu 


Sujay Sanghavi 

The University of Texas at Austin 
sanghavi@mail.utexas.edu 


Inderjit S. Dhillon 
The University of Texas at Austin 
inderjit@cs.utexas.edu 


July 17, 2015 


Abstract 

In this paper we consider the collaborative ranking setting: a pool of users each provides a 
small number of pairwise preferences between d possible items; from these we need to predict 
each users preferences for items they have not yet seen. We do so by fitting a rank r score 
matrix to the pairwise data, and provide two main contributions: 

(a) we show that an algorithm based on convex optimization provides good generalization guar¬ 
antees once each user provides as few as 0(r log^ d) pairwise comparisons - essentially matching 
the sample complexity required in the related matrix completion setting (which uses actual nu¬ 
merical as opposed to pairwise information), and 

(b) we develop a large-scale non-convex implementation, which we call AltSVM, that trains a 
factored form of the matrix via alternating minimization (which we show reduces to alternating 
SVM problems), and scales and parallelizes very well to large problem settings. It also outper¬ 
forms common baselines on many moderately large popular collaborative filtering datasets in 
both NDCG and in other measures of ranking performance. 


1 Introduction 

This paper considers the following recommendation system problem: given a set of items, a set 
of users, and non-numerical pairwise comparison data, find the underlying preference ordering of 
the users. In particular, we are interested in the setting where data is of the form “user i preferes 
item j over item k", for different ordered user-item-item triples i,j,k. Pairwise preference data is 
wide-spread; indeed, almost any setting where a user is presented with a menu of options - and 
chooses one of them - can be considered to be providing a pairwise preference between the chosen 
item and every other item that is presented. 

Crucially, we are interested in the collaborative filtering setting, where (a) on the one hand 
the number of such pairwise preferences we have for any one user is woefully insufficient to infer 
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anything for that user in isolation; and (b) on the other hand, we aim for personalization, i.e. for 
every user to possibly have different inferred preferences from every other. To reconcile these two 
requirements, our method relates the preferences of users to each other via a low-rank matrix, 
which we (implicitly) assume governs the observed preferences. Essentially, we fit a low-rank users 
X items score matrix X to pairwise comparison data by trying to ensure that Xij — Xik is positive 
when user i prefers item j to item k. 

Our contributions: We present two algorithms to infer the score matrix X from training data; 
once inferred, this can be used for predicting future preferences. While there has been some recent 
work on fitting low-rank score matrices to pairwise preference data (which we review and compare 
to below), in this paper we present the following two contributions: 

(a) A statistical analysis for the convex relaxation: we bound the generalization error of the solution 
to our convex program. Essentially, we show that the minimizer of the empirical loss also almost 
minimizes the true expected loss. We also give a lower bound showing that our error rate is sharp 
up to logarithmic factors. 

(b) ^ large-scale non-convex implementation: We provide a non-convex algorithm that we call 
Alternating Support Vector Machine (AltSVM). This non-convex algorithm is more practical than 
the convex program in a large-scale setting; it explicitly parameterizes the low-rank matrix in 
factored form and minimizes the hinge loss. Crucially, each step in this algorithm can be formulated 
as a standard SVM that updates one of the two factors; the algorithm proceeds by alternating 
updates to both factors. We apply a stochastic version of dual coordinate descent [3 [22] with lock- 
free parallelization. This exploits the problem structure and ensures it parallelizes well. We show 
that our algorithm outperforms several existing collaborative ranking algorithms in both speed and 
prediction accuracy, and it achieves significant speedups as the number of cores increases. 

1.1 Related Work 

Ranking/learning preferences is a classical problem that has been considered in a large amount of 
work. There are many different settings for this problem, which we discuss below. 

Learning to Rank The main problem in this community has been to estimate a ranking function 
from given feature vectors and relevance scores. Depending on its application, a feature vector may 
correpond to a user-item pair or a single item. While there have been algorithms that use pairwise 
comparisons mm of the training samples, our setting is different in that our data consists only 
of pairwise comparisons. We refer the reader to the snrvey [El- 

One ranking with pairwise comparisions In a single-nser model, we are asked to learn a single 
ranking given pairwise comparisons. Jamieson & Nowak m and Ailon [T] consider an active query 
model with noiseless responses; Jamieson & Nowak m give an algorithm for exactly recovering the 
true ranking under a low-rank assumption similar to ours, while Ailon [T] approximately recovers 
the true ranking without such an assumption. Wanthier et al. |27| and Negahban et al. [I8| learn 
a ranking from noisy pairwise comparisions; Negahban et al. m consider a Bradley-Terry-Luce 
model similar to ours and attempt to learn an underlying score vector, while Wauthier et al. |27) 
get by withont strncture assnmptions, but only attempt to learn the ranking itself. Hajek et al. [5| 
considered a problem to learn a single ranking given a more generalized partial rankings from the 
Plackett-Luce model and provided a minimax-optimal algorithm. 
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Many rankings with pairwise comparisions Given multiple users with different rankings, 
one could of course attempt to learn their rankings by simply applying an algorithm from the 
previous section to each user individually. However, it is more efficient - both statistically and 
computationally - to postulate some global structure and use it to relate the many users’ rankings. 
This is the same idea that has been applied so successfully in collaborative filtering. Rendle et al. 
[2U| and Liu et al. m were the first to take this approach. They modeled the observations as coming 
from a BTL model with low-rank structure (i.e., very similar to our model) and gave algorithms for 
learning the model parameters. Yi et al. m took a purely optimization-based approach. Rather 
than assuming a probabilistic model, they minimized a convex objective using the hinge loss on a 
low-rank matrix. In a slightly different model, Hu et al. [9] and Shi et al. [23] consider the problem 
of learning from latent feedback. Recently, Lu & Negahban [T6| analyzed an algorithm which is 
very similar to ours for the Bradley-Terry-Luce model independently from our work. 

Many rankings with 1-bit ratings Instead of moving to pairwise comparisons, some work has 
suggested avoiding the difficulties of numerical ratings by instead asking users to give 1-bit ratings 
to items; that is, each user only indicates whether they like or dislike an item. In this setting, the 
work of Davenport et al. |1| is most closely related to ours, in that they assume an underlying low- 
rank structure and give an algorithm based on convex optimization. Also, our theoretical analysis 
owes a lot to their work. Xu et al. m consider a slightly different goal: rather than attempting 
to recover the preferences of each user, they try to cluster similar users and similar items together. 
Yun et al. |32| proposed an optimization problem motivated from robust binary classification and 
used stochastic gradient descent to solve the problem in a large-scale setting. 

Many rankings with numerical ratings The goal in this setting is the same as ours, except 
that the data is in the form of numerical ratings instead of pairwise comparisons. Weimer et al. [28| 
attempted to directly optimize Normalized Discounted Cumulative Gain (NDCG), a widely used 
performance measure for ranking problems. Balakrishnan &: Chopra |2|, and Volkovs & Zemel |26j 
converted this problem into a learning-to-rank problem and solved it using the existing algorithms. 
While these works considered the low-rank matrix model, different models are proposed by Weston 
et al. [29] and Lee et al. m- Weston et al. |29| proposed a tensor model to rank items for different 
queries and users, and m proposed a weighted sum of low-rank matrix models. 

2 Empirical Risk Minimization (ERM) 

Let us first formulate the problem mathematically. The task is to estimate rankings of multiple 
users on multiple items. We denote the numbers of users by di, and the number of items by d 2 . 
We are given a set of triples H C [di] x [^ 2 ] x [^ 2 ], where the preference of user i between items 
j and k is observed if (i,j,k) G H. The observed comparison is then given by G {1,-1} : 

{i,j,k) G H} where Vjfc = 1 if user i prefers item j over item k, and Yijk = — 1 otherwise. Let 

= {(y k) : {i,j, k) G D} denote the set of item pairs that user i has compared. 

We predict rankings for multiple users by estimating a score matrix X G that 

Xij > Xik means that user i prefers item j over item k. Then the sorting order for each row 
provides the predicted ranking for the corresponding user. 

We propose (as have others) that X is low-rank or close to low-rank, the intuition being that 
each user bases their preferences on a small set of features that are common among all the items. 
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Then the empirical risk minimization (ERM) framework can naturally be formulated as 

minmize ^ C{Yijk{Xij - Xik)) (1) 

subject to rank(X) < r 

where £(•) is a monotonically non-increasing loss function which induces Xij > Xik if = 1) and 
Xij < Xik otherwise, (e.g., hinge loss, logistic regression loss, etc.) 

Solving Q is NP-hard because of the rank constraint. As a first alternative, we propose a 
straightforward convex relaxation. 


3 Convex Relaxation 

of Q, which involves a nuclear norm constraint. 

^ £{Yijk{Xij - X,k)) ( 2 ) 

{i,j,k)£Q 

||A||* < \d1d2 

Here, for any matrix A, the nuclear/trace norm ||A||* denotes the sum of its singular values; it 
is a well-recognized convex surrogate for low-rank structure (most famously in matrix completion). 

The only parameter of this algorithm is A, which governs the trade-off between better optimizing 
the likelihood of the observed data, and the strictness in imposing approximate low-rank structure. 
Since we motivated our algorithm with the assumption that X has low rank, we should point out 
how our algorithm’s parameter A compares to the rank: note that if A is a di x 6,2 rank-r matrix 
whose largest absolute entry is bounded by C then ||A||* < y^HAHi? < C\/rdid 2 - In other words, 
A is a parameter that takes into account both the rank of A and the size of its elements, and it is 
roughly proportional to the rank. 


Our first method is the convex relaxation 

minimize 

subject to 


3.1 Analytic results 

We analyze (§ by assuming a standard model for pairwise comparisons. Then we provide a 
statistical guarantee of the method under the model. 

Recall the classical Bradley-Terry-Luce model mm for pairwise preferences of a single user, 
which assumes that the probability of item j being preferred over k is given by a logistic of the 
difference of the underlying preference scores of the two items. For multiple users, we assume that 
there is some true score matrix A* G 

i + exp(Y--vy 

Assume that each user-item-item triple (i,j,k) independently belongs to with probability 
Pij^k^ and let m = Yli j kPi,j,ic expected size of Cl. We will assume that the Pij^k are 

approximately balanced in the sense that no user-item pair is observed too frequently: 



Assumption 3.1. 


There is a constant k > 0 such that for every i,j, 

m 




did2 
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Note that if k = 1 in Assumption 3.1 then the pi j k are all equal, meaning that each user-item- 


item triple has an equal chance to be observed. 

In order to state our error bounds, we first introduce some notation: let Px be the distribution 
of ^ < i < di,l < j < k < d 2 } (i.e. the complete distribution of all pairwise preferences, 

even those that are not observed). 

Our main upper bound shows that if m is sufficiently large then our algorithm finds a solution 
with almost minimal risk. Given a loss function £, define the expected risk of X by 


1 


di d2 


R{X) = - X,k)), 


2=1 j,k=l 

where the expectation is with respect to the distribution parametrized by the true parameters X*. 
Theorem 3.1. Suppose that C is 1-Lipschitz, and let Y and 11 be distributed as ¥x* for some 


di X d 2 matrix X*. Under Assumption 3.1 


ER{X) < inf ER{X) + CkJ ^ log(di + da), 

{X:||Js:||.<7131^1 V m 

where C is a universal constant. 

We recall that the parameter A is related to rank in that if A is a di x da rank-r matrix whose 
largest absolute entry is bounded by C then ||A||* < y^||A||ir < Cy/rdid 2 . In other words, A is 
a parameter that takes into account both the rank of X* and the size of its elements, and it is 
roughly proportional to the rank. In particular. Theorem 3.1 shows that once we observe m ~ r(di-|- 
da) log^(di -|- da) pairwise comparisons, then we can accurately estimate the probability of any user 
preferring any item over any other. In other words, we need to observe about r(l -|-da/di) log^(di -|- 
da) comparisons per user, which is substantially less than the rda log(da) comparisons that we would 
have required if each user were modelled in isolation. Moreover, our lower bound (below) shows 
that at least r(l -|-da/di) comparisons per user are required, which is only a logarithmic factor from 
the upper bound. 

Theorem 3.2. Suppose that C'{D) < 0. Let A be any algorithm that receives {Yij^k '■ G 

as input and produces X as output. For any A > 1 and m > di -|- da, there exists X* with 
II A* II* < \/Adida such that when Y and Ll are distributed according to Px* then with probability at 
least 


ER{X) > R{X*) -1 cmin M, 


where c> 0 is a constant depending only on C. 


A(di -|- da) 


m 


Together, Theorems 3.1 and 3.2 show that (up to logarithmic factors) if X* has rank r then about 


r(l -|- da/di) comparisons per user are necessary and sufficient for learning the users’ preferences. 


3.1.1 Maximum likelihood estimation of A* 

By specializing the loss function C, Theorem |3.1| has a simple corollary for maximum-likelihood 
estimation of A*. Recall that if p and v are two probability distributions on a finite set S the the 
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Kullback-Leibler divergence between them is 




seS 


under the convention that OlogO = 0. We recall that although -D(-||-) is not a metric it is always 
non-negative, and that D(fi\\iy) = 0 implies ^ 

Corollary 3.3. Let Y and Ll be distributed as Pjv* for some di x d 2 matrix X*. Define the loss 


function C by C{z) = log(l -|- exp(z)) — z. Under Assumption 3.1 
1 


sup 


didl {x-.\\x\\^<YLd^} 

where C is a universal constant. 


Difx* 


■ X 


-D{¥x*Wx)<Ck 


X{di Y dfi) 


m 


log{di + d 2 ), 


Note that the loss functio n in Corollary 3.3 is exactly the negative logarithm of the logi stic 
function, and so X in Corollary |3.3| is the maximum-likelihood estimate for X*. Thus, Corollary |3.3| 
shows that the distribution induced by the maximum-likelihood estimator is close to the true 
distribution in Kullback-Leibler divergence. 


4 Large-scale Non-convex Implementation 

While the convex relaxation is statistically near optimal, it is not ideal for large-scale datasets 
because it requires the solution of a convex program with di x d 2 variables. In this section we 
develop a non-convex variant which both scales and parallelizes very well, and has better empirical 
performance as compared to several existing empirical baseline methods. 

Our approach is based on the following steps: 

1. We represent the low-rank matrix in explicit factored form X = UV~^ and replace the regu- 

larizer appropriately. This results in a non-convex optimization problem in U £ and 

V £ where r is the rank parameter. 

2. We solve the non-convex problem by alternating between updating U while keeping V fixed, 
and vice versa. With the hinge loss (which we found works best in experiments), each of 
these becomes an SVM problem - hence we call our algorithm AltSVM. 

3. The problem is of course not symmetric in U and V because users rank items but not vice 
versa. For the U update, each user vector naturally decouples and can be done in parallel 
(and in fact just reduces to the case of rankSVM [E]). 

4. For the V update, we show that this can also be made into an SVM problem; however it 
involves coupling of all item vectors, and all user ratings. We employ several tricks (detailed 
below) to speed up and effectively parallelize this step. 

The non-convex problem can be written as 

d:{Yijk-uJ{vj -Vk)) + ^{\\U\\l + \\V\\l) (3) 
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where we replace the nuclear norm regularizer using the property ||X||* = mhix=uv^ Idl^IlF + 
ll^lll’) l2l]- ul and vj denote the ith rows of U and V, respectively. While this is a non-convex 
algorithm for which it is hard to find the global optimum, it is computationally more efficient since 
only (di +d 2 )r variables are involved. We propose to use L2 hinge loss, i.e., £(x) = max(0,1 — x)^. 

In the alternating minimization of Q, the subproblem for U is to solve 

U ^ arg mm ^ £(Yijk ■ uj{vj - vj)} + ^ \\U\\l, (4) 


while V is fixed. This can be decomposed into n independent problems for Ui’s where each solves 
for 


arg mm — 
° ueR’’ 2 


u 


h + 


(j,fc)en, 


^ Vk). 


(5) 


This part is in general a small-scale problem as the dimension is r, and the sample size is |nj| for 
each user i. 

On the other hand, solving for V with fixed U can be written as 


V arg min 


A 

2 


11^111 + 


{i,j,k)en 


( 6 ) 


where g jg gf j^ii,j,k) jg . yT [f I — —Yijk ■ uJ if I = k, and 

0 otherwise. It is a much larger SVM problem than Q as the dimension is d 2 r and the sample size 
is |n|. 

We note that the feature matrices k) G 0} are highly sparse since in each feature 

matrix only 2r out of the d 2 r elements are nonzero. This motivates us to apply the stochastic 
dual coordinate descent algorithm 13 E2], which not only converges fast but also takes advantages 
of feature sparsity in linear SVMs. Each coordinate descent step takes 0{r) computation, and 
iterations over |n| coordinates provide linear convergence |22| . 

Now we describe the dual problems of our two subproblems explicitly. Let a G denote the 
dual vector for in which each coordinate is denoted by aijk where (j, fc) G 12*. Then the dual 
problem of ([^ is to solve 


1 

min - 

Q:gRTil,Q:>o2 


^ ^ C^ijk^ijkixj Vk) 
(j,k)£Qi 


U,k)£Qi 


(7) 


where C*{z) is the convex conjugate of £. At each coordinate descent step for Oijk, we find 
the value of a^jk minimizing 0 while all the other variables are fixed. If we maintain Ui = 
Yl(jk)eni OiijkYijk{vj — Xk)i then the coordinate descent step is simply to find 6 * minimizing 


1 

2 


Ui Y d Yij}^{vj 


Xk)\^ + {—\{aijk + 5*)) 


and update aijk ^ ctijk + <5*- 


( 8 ) 
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Algorithm 1 Alternating Support Vector Machine (AltSVM) 


Require: O, {lijfc : {i,j,k) G O}, and A G M'*' 
Ensure: U G V G 

1: Initialize U, and set a, j3 <— 0 G 
2: while not converged do 
3 : Vj G- Yl(iJ,k)&n 

~ '^2(i,k,j)eQ l^ikj'^ikjUi, Vj G [^ 2 ] 


4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 


for all threads t = 1,..., T in parallel do 
for s = 1,..., S' do 

Choose {i,j, k) G Q uniformly at random 
Find (5* minimizing (10). 


Pijk ^ l^ijk “f ^ 

+ ^*^ijkUi 


end for 
end for 


Ui 'l2{iJ,k)&Q ^ijk^ijki'^j '^k)i Vi G [dl] 

for all threads t = 1,..., T in parallel do 
for s = 1,..., S do 

Choose (i, j, /c) G fl uniformly at random. 
Find S* minimizing ([^. 
oiijk ^ Oiijk + 6* 

Hi i Ui ^ '^ijkiyj I’/c) 

end for 
end for 
end while 


The dual problem of ([^ is to solve 


1 

min - 

/3eiRl^l,/3>o2 




2 

p (i,j,fc)eo 


(9) 


where /3 is the dual vector for the subproblem ([^. Similarly to aijk, the coordinate descent step 
for I3ijk is to replace (3ijk by /3ijk + S* where S* minimizes 


1 

2 


+ 5*YijkUi\\l 




+ C* { — X{/3ijk + d*)), 


( 10 ) 


and maintain V = 'E{i,j,k)&fi 

The detailed description of AltSVM is presented in Algorithm In each subproblem, we 
run the stochastic dual coordinate descent, in which a pairwise comparison {i,j,k) G 11 is chosen 
uniformly at random, and the dual coordinate descent for aijk or /3ijk is computed. We note that 
each coordinate descent step takes the same 0 {r) computational cost in both subproblems, while 
the subproblem sizes are much different. 












4.1 Parallelization 


For each subproblem, we parallelize the stochastic dual coordinate descent algorithm asynchronously 
without locking. Given T processors, each processor randomly sample a triple {i,j, k) € and up¬ 
date the corresponding dual variable and the user or item vectors. We note that this update is for a 
sparse subset of the parameters. In the user part, a coordinate descent step for one sample updates 
only r out of the rdi variables. In the item part, one coordinate descent step for a sample update 
only 2r out of the rd 2 variables. This motivates us not to lock the variables when updated, so 
that we ignore the conflicts. This lock-free parallelism is shown to be effective in |19) for stochastic 
gradient descent (SGD) on the sum of sparse functions. Moreover, in [8], it is also shown that the 
stochastic dual coordinate descent scales well without locking. We implemented the algorithm using 
the OpenMP framework. In our implementations, we also parallelized steps 3 and 13 of Algorithm 
We show in the next section that our proposed algorithm scales up favorably. 

4.2 Remark on the implementation 

In Algorithm the subproblem for V comes first, and then it solves for the user vectors U. We 
empirically observed that this order gives better convergence on practical datasets. We also note 
that each subproblem reuses the dual variables in the previous outer iteration. When almost 
converged, the features {V for solving U, and U for solving V) do not change too much. By 
reusing the dual variables in the previous iteration we can start with a feasible solution close to the 
optimum. 


5 Experimental results 

5.1 Pairwise data 


We used the MovieLens 100k dataset, which contains 100,000 ratings given by 943 users on 1682 
movies. The ratings are given as integers from one to five, but we converted them into preference 
data by declaring that a user preferred one movie to another if they gave it a higher rating (if two 
movies received the same rating, we treated it as though the user did not provide a preference). 
Then we held out 20% of the data as a test set. 

We compared our algorithm to the following two: 

• Bayesian Personalized Ranking (BPR) [20]; This algorithm is based on a similar model 
to ours, but a different optimization procedure (essentially, a variant of stochastic gradient 
descent). 


• Matrix completion from pairwise differences : A standard matrix completion algorithm that 
observes - for various triples {i,j, k) € Q - the difference between user z’s ratings for item j 
and item k. Note that this algorithm has an advantage over Q because it sees the magnitude 
of this difference instead of only its sign. Nevertheless, the matrix completion algorithm does 
not perform any better than Q. A similar phenomenon was also observed in |3|. 

We evaluate our performance by computing the proportion of pairwise comparisons in the test 
set T for which we correctly infer the user’s preference. 


(Prediction error) 


i(w, > Wfc) 

{iJ,k)£T,Y^jk = l 
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Figure 1: Prediction accuracy on the MovieLens 100k dataset, for different numbers of observed 
comparisions per user. For the “restricted” plots, only the pairs with a rating difference of two or 
more were used for evaluation. 


This is similar to the AUC statistic measured by Rendle et al. [20], and if the data were fully 
observed then it would measure Kendall’s distance between each user’s true preferences and the 
learned ones. However, our main reason for choosing this measure of performance is that, as an 
average accuracy over all pairwise comparisions, it resembles the quantity that we study in our 
theoretical bounds. 

Unsurprisingly, we were more accurate at correctly inferring strong preferences; therefore, we 
have also shown the accuracy obtained by only measuring performance on pairs whose rankings 
differ by two or more. Both the methods we considered do measurably better at predicting these 
orderings. 

5.2 Large-scale experiments on rating data 

Now we demonstrate that our algorithm performs well as a collaborative ranking method on rating 
data. We used the datasets specified in Table Given a training set of ratings for each user, our 
algorithm will only use non-tying pairwise comparisons from the set, while other competing algo¬ 
rithms use the ratings themselves. Hence, they have more information than ours. The competing 
algorithms are those with publicly available codes provided by the authors. 

• CofiRank (28Q This algorithm uses alternating minimization to directly optimize NDCG. 

Tttp://www.cofirank.org, The dimension and the regularization parameter are set as suggested in the paper. 
For the rest of the parameters, we left them as provided. 
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MovieLenslm 

MovieLenslOm 

Netflix 

Users 

6,040 

71,567 

480,000 

Items 

3,900 

10,681 

17,000 

Ratings 

1,000,209 

10,000,054 

100,000,000 


Table 1: Datasets to be used for simulation 


• Local Collaborative Ranking (LCR) [ISQ : The main idea is to predict preferences from the 
weighted sum of multiple low-rank matrices model. 

• RobiRank IS20: This algorithm uses stochastic gradient descent to optimize the loss function 
motivated from robust binary classification. 

• Global Ranking : To see the effect of personalized ranking, we compare the results with a 
global ranking of the items. We fixed U to all ones and solved for V. 


The algorithms are compared in terms of two standard performance measures of ranking, which 
are NDCG and Precision@iL. NDCGORT is the ranking measure for numerical ratings. NDCG@iL 
for user i is dehned as 


where 


NDCG@R:(i) 


DCG@R:(i,7ri) 

DCG@R:(i,7r*) 


DCG@K(i,7ri) 


2^iTTi(k) _ ]_ 

^ \og2{k + l)' 


and TTuik) is the index of the kth ranked item of 71 in our prediction. Mij is the true rating of item 
j by user i in the given dataset, and vr* is the permutation that maximizes DCG@iL. This measure 
counts only the top K items in our predicted ranking and put more weights on the prediction of 
highly ranked items. We measured NDCG@10 in our experiments. Precision@iL is the ranking 
measure for binary ratings. Precision@iL for user i is defined as 

Precision@Rr(i) = ^ ^ Mij 

j&VKii) 


where Mij is the binary rating on item j by user i given in the dataset. This counts the number of 
relevant items in the predicted top K recommendation. These two measures are averaged over all 
of the users. 

We first compare our algorithm with numerical rating based algorithms, GofiRank and LGR. 
We follow the standard setting that are used in the collaborative ranking literature [281 m [261 IS]. 

^http://prea.gatech.edu, We run the code with each of the 48 sets of loss function and parameters given in 
the main code, and the best result is reported. We could not run this algorithm on the Netflix dataset due to time 
constraint. 

“https: //bitbucket. org/d_ijk_stra/robirank, We used the part for collaborative ranking from binary relevence 
score. We left the parameter settings as provide with the implementation. 
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(a) 



Time (seconds) 



Time (seconds) 


(b) 


(c) 


Figure 2: NDCG@10 and Precision@10 over time for different algorithms. 


Datasets 

N 

AltSVM 

AltSVM-sub 

AltSVM-logistic 

Global 

CofiRank 

LCR 


20 

0.7308 

0.6998 

0.7125 

0.7500 

0.7333 

0.7007 

MovieLenslm 

50 

0.7712 

0.7392 

0.7141 

0.7501 

0.7441 

0.7081 


100 

0.7902 

0.7508 

0.7446 

0.7482 

0.7332 

0.7151 


20 

0.7059 

0.7053 

0.7031 

0.7264 

0.7076 

0.6977 

MovieLenslOm 

50 

0.7508 

0.7212 

0.7115 

0.7176 

0.6977 

0.6940 


100 

0.7692 

0.7248 

0.7292 

0.7101 

0.6754 

0.6899 


20 

0.7132 

0.6822 

- 

0.7605 

0.6615 

- 

Netflix 

50 

0.7642 

0.7111 

- 

0.7640 

0.6527 

- 


100 

0.8007 

0.7393 

- 

0.7656 

0.6385 

- 


Table 2: NDCG@10 on different datasets, for different numbers of observed ratings per user. 


For each user, we subsampled N ratings, used them for training, and took the rest of the ratings for 
test. The users with less than + 10 ratings were dropped out. Table compares AltSVM with 
numerical rating based algorithms. While = 20 is too small so that a global ranking provides 
the best NDCG, our algorithm performs the best with larger N. We also ran our algorithm with 
subsampled pairwise comparions with the largest numerical gap (AltSVM-sub), which are as many 
as N for each user (the number of numerical ratings used in the other algorithms). Even with this, 
we could achieve better NDGG. We can also observe that the statistical performance is better with 
the hinge loss than with the logistic loss. 

We have also experimented with collaborative ranking on binary ratings. We compare our 
algorithm against RobiRank [32], which is a recently proposed algorithm for collaborative ranking 
with binary ratings. We ran an experiment on a binarized version of the Movielenslm dataset. In 
this case, the movies rated by a user is assumed to be relevant to the user, and the other items are 
not. Since it is inefficient to take all possible comparisons which are in average a half million per 
user, we subsampled C comparisons for each user. Both algorithms are set to estimate rank-100 
matrices. Table shows that our algorithm provides better performance than RobiRank. 

5.3 Computational speed and Scalability 

We now show the computational speed and scalability of our practical algorithm, AltSVM. The 
experiments were run on a single 16-core machine in the Stampede Gluster at University of Texas. 

Figures]^ and[^ show NDGG@10 over time of our algorithms with 1, 4, and 16 threads, com- 
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Precision® 

C = 1000 

AltSVM 

C = 2000 

C = 5000 

RobiRank 

1 

0.2165 

0.2973 

0.3635 

0.3009 

2 

0.1965 

0.2657 

0.3297 

0.2695 

5 

0.1572 

0.2097 

0.2697 

0.2300 

10 

0.1265 

0.1709 

0.2223 

0.1922 

100 

0.0526 

0.0678 

0.0819 

0.0781 


Table 3: Precision@iir on the binarized MovieLenslm dataset. 


# cores 

1 

2 

4 

8 

16 

Time (seconds) 

963.1 

691.8 

365.1 

188.3 

111.0 

Speedup 

lx 

1.4x 

2.6x 

5.lx 

8.7x 


Table 4: Scalability of AltSVM on the binarized MovieLenslm dataset. 


pared to CofiRank. Figure shows Precision@10 over time of our algorithm with C = 5000. We 
note that our algorithm converges faster, while the sample size |fl| for our algorithm is larger than 
the number of training ratings that are used in the competing algorithms. Table shows the scala¬ 
bility of AltSVM. We measured the time to achieve 10“^ tolerance on the binarized MovieLenslm 
dataset. As can be seen in the table, we could achieve significant speedup. 

6 Conclusion 

We considered the collaborative ranking problem where one fits a low-rank matrix to the pairwise 
comparisons by multiple users. We showed that the convex relaxation of the empirical risk min¬ 
imization provides good generalization guarantees. For the large-scale practical settings, we also 
proposed a non-convex algorithm, which alternately solves two SVM problems. Our algorithm was 
shown to outperform the existing ones and parallelizes well. 
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A Proof of Theorem 3.1 

We write L{X) for the function being optimized; i.e., 

L{x)= ^yi,j,k{Xi,j-Xi^k))- 
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Note that for any fixed X, ¥’x*L{X) = mR{X) (where Px* denotes the expectation taken with 
respect to future sampies from Px*; as distinct from E which denotes the expectation over the 
sampies used to generate X). Let K be the set of di x d 2 matrices with nuciear norm at most 1. 
The proof of Theorem 3.1 proceeds in three main steps. 

1. By some aigebraic of manipuiations L, we reduce the probiem to showing a uniform iaw of 
iarge numbers for the famiiy of functions {L{X) : X G ^/Xd^d^K}. 

2. Using symmetrization and duaiity properties of iL, we reduce the probiem to bounding the 
norm of a matrix M whose entries are sums of random signs. 

3. We bound the norm of M using various concentration inequaiities and a theorem of Seginer |21) . 
Since X, by dehnition, minimizes L{X), for any X G y/Xd^d^K we can bound 

Px* [L{X) - L{X)] < Px. [L{X)] - L{X) - (Px* [L{X)] - L(X)) 

<2 sup \¥x*L{X)-L{X)\. 

X<^y/\d\d2k 

In other words, it suffices to show a uniform iaw of iarge numbers for {L{X) : X G y/XdidXzK}. 

Let €ij^k be i.i.d. ±l-vaiued variabies and iet be the indicator that {i,j,k) G Cl. By 
Gine-Zinn’s symmetrization (as in 0), 

sup \Fx*L{X) - L{X)\ 

X£y/X3dd2K 

^ 2E sup ^ ^ (^i,j,kk^{Xi jk{Xi j Xi k)) ■ 

XGy/XdicRX ij^kG^ 

Since C is 1-Lipschitz, we obtain 


sup |Px*[L(X)] — L{X)\ < 2E sup 

Xe^Xdid2K X£^\did2K 


= 2E sup 

X£^/\did2K 




Ci,j,k(^i,j,kiXij Xi^k) 

i,j,k 


where in the iast iine, we recognized that ei^j^kYij^k bas the same distribution as Now, iet M 

denote the matrix where Mij = ~ ^i,k,j^i,k,j)- Then 


'^^i,j,k€i,j,kiXij - Xi^k) = tr(M^X) 

i,j,k 

and so 

sup '^Ci,j,kei,j,k{Xij - Xi^k) = sup tr(M^X) = y/xX^\\M\\. 

X^\/Xdid2X i j XXdid2K 

Putting everything together, we have (for any X G y/Xd^d^K) 


E 


Px*[L(X)]-Px*[L(X)]l <4v^A^E||M||. 
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Together with the following lemma (which we prove in Appendix , this completes the proof of 
Theorem 3.1 

Lemma A.l. With p = 


E||M|| < Cfi;A/p(d7+~^log(di(i2)- 


B 


Proof of Lemma 


KH 


We will decompose M into two parts, M = with 

^ij ^ ~ ’’Yh 

k¥=j 

Then ||M|| < ||M(^)|| + ||M(^)||. Since and have the same distribution, 

E||M|| < 2E||M(^)||, 

and so we are reduced to studying which has i.i.d. entries. Now, we apply Seginer’s theo¬ 

rem |21j : 

E||M(i)|| < C (^Emax||M^^^^||2 +Emax||Mi)^||2^ , (11) 

where denotes the ith row of and denotes the jth column, and || • II 2 denotes the 
Euclidean norm. 

We will separate the task of bounding E max^ ||2 into two parts: if ||x||o denotes the 

number of non-zero coordinates in x and ||x||oo denotes maxj \xj\ then ||x ||2 < y^||x||o||x||oo; with 
the Cauchy-Schwarz inequality, this implies that 


E 


max 11I 


< E 




( 12 ) 


First, we will show that every row of is sparse. Let Zij = Ylk^j ^i,j,k and let Yij be the 
indicator that Z^j > 0. Recalling that = Pi,j,kj we have (by Assumption 3.1) EZjj < np. Since 

Zij takes non-negative integer values, we have Pr(l^j = 1) = Pr(Zjj > 0) < Kp. By Bernstein’s 
inequality, for any fixed i 


d 2 / 

PrdlM^^^llo > Kd2 pYt) < Pr(^ Yij > Kd 2 P + t) < exp ( 

j=i V 


f /2 


Kpd2 -|- t/3 


Integrating by parts, we have 


E 


||A/^*^||ol < Kd2P + 


poo 

/ PrdlM^I 

J Kd2P 


0 >t) dt < Kd 2 P + 


S' 
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Next, we will consider the size of the elements in First of all, < Zij (this fairly 

crude bound will lose us a factor of Y^log(did^). Now, Bernstein’s inequality applied to Zij gives 

>Kp + t)< Y>i{Zij > up+ t) < exp ■ 

Taking a union bound over i and j, if t > CK\og{did 2 ) then 

Pr(maxM^^.^^ > t) < did 2 ex.p {—ct) < exp(—c't). 

ij ■> 


Integrating by parts. 


E 


maxM, 


( 1 ) 






< /tlog' 


‘{did2) + f 

J K 


Klog-^(did2) 


Pr(maxMj*'-^^ > \/i) dt < Klog‘^{did 2 ) + C. 




Going back to (12), we have shown that 

Emax||M}J^|| < CKy/p^log{did2). 

i 

The same argument applies to (but with \/pdi instead of \/j ^), and so we conclude from 0 
that 

E||M«|| < CK^/p{di + d2)log{did2). 


C Proof of Theorem 3.2 


C.l A sketch of the proof 


The proof of Theorem 3.2 uses Fano’s inequality. 

1. We construct matrices X ^,..., X^. These matrices all have small nuclear norm, and for every 
pair i,j the KL-divergence between the induced observation distributions is 0(log.£). We 
construct these matrices randomly, using concentration inequalities and a union bound to 
show that we can take £ of the order \m{di + ^ 2 )- 

2. We apply Fano’s inequality to show that if we generate data according to a randomly chosen 
A*, then any algorithm has a reasonable chance to choose a different X^ (using the fact that 
the KL-divergence is 0{log£)). Since the KL-divergence is r2(logt'), this implies that the 
algorithm incurs a substantial penalty whenever it makes a wrong choice. 


In any application of Fano’s inequality, the key is to construct a large number of admissible 
models that are close to one another in KL-divergence. Specifically, if we can construct distributions 
Pi,...,P£ with T>(Pj||Pj) -|- 1 < ^log^ for all i,j, then given a single sample from some Pj, no 
algorithm can accurately identify which Pj it came from. In order to apply this denote by Px,m the 
distribution of the data when the true parameters are X. We will construct X^ , X^ G y/Xd^d^K 
such that for all i ^ j, 




Rj{X^) > Rj{X^) + c 


\ogi 
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m 


(13) 

(14) 











for some constant c > 0, where Rj denotes the expected risk when the true parameters are given 
by XK Given a single observation from some (13) will imply (by Fano’s inequality) that 


no algorithm can correctly identify which was the true parameter. On the other hand, (14) 
will imply that if the algorithm makes a mistake - say it chooses X^ for i ^ j ~ then its risk 
will be larger than the best in the class. In particular, if we can prove (13) and (14) with 
log.^ ~ Xm{di + d 2 ) then it will imply Theorem 3.2. 

We construct a set of matrices satisfying (13) and (14) using a probabilistic method. Supposing 
that d 2 > di, we choose a parameter 7 > 0 and set B to be an integer that is approximately 
We define X^ by filling its top B xd 2 block with independent, uniform ±7 entries, and then copying 
that top block B/di times to fill the matrix. Then let X^,... ,X^ be independent copies of X^. 
First of all, each X* G y/Xdid^K because ||-T*||* < Y^rank(X*)||X*||ir < ^/Xdid^- 

Now, let us consider D(¥x^,m\Wx'^,m)- For a single i,j,k triple, there is probability 1/4 of 
having Xl- — Xjj, different from Xf^ — Xfy., in which case they differ by 47 . If 7 is bounded above, 
each different entry contributes 0 (a^ 7 ^) to the KL-divergence between Pjchm Since 


about m entries are observed in see that 

D(J^X^,m\\^X‘2,m) 




(15) 


On the other hand, i?i(X^) and Ri{X‘^) differ by ©( 7 ^), because for a constant fraction of triples 
i,j,k, the chance that is 1 differs by 0 ( 7 ) in X^ and and on the event that Yij^k differs 
in these two models the loss differs by another 0 ( 7 ) factor. 

Applying standard concentration inequalities, we show that one can apply the union bound to 

and (15), we need to take Bd 2 = x 


i = exp(ci? 6 ^ 2 ) of these matrices. In view of 
Eliminating 7 , we end up with log£ x ^^^Xm/di (which is within a constant factor of \m{di + ^ 2 ) 
under our assumption that d 2 > di). 


C.2 Some concentration lemmas 

We begin by quoting some standard concentration results (see, e.g. [22] )• 

Definition C.l. A random variable X is a'^-subgaussian ifKe^^ < for all 0 > 0. A random 

variable X is L-subexponential ifKe^^ < (1 — 6 ‘^L^) for 9 < 1/L. 

One can easily show that the product of two subgaussian variables is subexponential: 

Lemma C.2. If X is -subgaussian andY is -subgaussian then XY is Car-subexponential for 
a universal eonstant C. 

Moreover, one has a Bernstein-type inequality for sums of independent subexponential variables. 
Lemma C.3. If Xi,..., Xf^ are i.i.d. L-subexponential then 

i ^ ' 
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C.3 Construction of a packing set 

Let 0 < 7 < 1 be some parameter to be determined such that B := A 7 “^ is an integer. 

Proposition C.4. Suppose that C\0) < 0. For every sufficiently small 7 (depending on C), there 
exists a set X C of exp{cBd 2 ) di x ^2 matrices such that for any two X^,X‘^ G X, 

A E E ^xA^YiXf^ - Xf,)) - £{Y{Xy - Xi))] > C 72 


and for any m, 


— D(IPjs:l,ml|IP’x2,m) < 
m 


where 0 < c < C are universal constants. 

Following Davenport et ah, we construct this set X randomly: let X be a random Bxd 2 matrix, 
where each element is chosen independently to be either 7 or — 7 . 

Lemma C.5. Let X^ and X‘^ be independent copies of X. Then with probability at least 1 — 
exp{-cBd2), 

E E - E +> ‘^i^Bdi 

i=l j^k=l 

where c> 0 is a universal constant. 


Before proving Lemma C.5 let us see how it implies Proposition C.4 First of all, for X a 
random B x d 2 matrix as above, let X be the di x d 2 matrix obtained by stacking \di/B~\ copies of 
X, and filling out any remaining entries by zeros. Then, for random X and Y, with high probability 


di d 2 B d 2 

E ^+xif = \di/B-] Y, E - E+ 

i=l j,k=l i=l j,k=l 

X 7^dld2, 


2 \2 
ik) 


(16) 


where the lower bound for the last line came from Lemma C.5 and the upper bound just came 
from the observation that each term in the sum is bounded by 167 ^. Let X be the set obtaine d by 
choosing exp{cBd2/4:) random copies of X in this way. The high-probability estimate in Lemma C.5 


implies that with high probability, every pair in X satisfies (16). Now 




m 

did"^ 


Y D{f{Xl^-Xl^)\\f{Xl-Xl)) 

{i,j,k)&n 

Ym-xh-xf^+xi 

i,j,k 


^2 \2 
'^ikJ 5 


where f{x) = e*/(l + e*) is the logistic function, and the last line follows from a Taylor expansion 
of D{f{x)\\f{y)) around x = y, because all the Xh and Xf^ are bounded by 7 < 1. Together 


with (16), this proves the first inequality in Proposition C.4; the second inequality follows because 
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each term of the form D{f{Xij — Xii^)\\f{Yij — Yik)} is bounded by a constant times 7^. This proves 


the second inequality of Proposition C.4 


C.5 


By Taylor expansion again, if 7 is sufficiently small (depending on C) then 

- Xl,)) - - Xl,)) X Y.j^kiXl^ -Xl,-Xlj+ Xl,). 

Now, if i, j, k is a triple for which 27 = X^j —X^f^ > Xfj — Xfj^ (and under the event of Lemma 
there are at least cBd^ such triples) then x 7 and so 

E^i [£(y*,,,fc(x2. - xl,)) - m^,,k{xl^ - xl,))] X 72. 

The same holds when i,j, k is a triple for which —27 = Xf^ — < Xfj — Xfj^. Finally, if i,j, k is 

a triple such that Xlj — Xlj^ = Xf^ — Xff^ then the expectation is zero. Summing over all triples, 
we see that on the event that Lemma C.5 holds, 

-xi,))-mj^kixy-xi,))] > 


After summing over all \di/B~\ blocks, this proves the hrst inequality of Proposition C.4 


Proof of Lemma C.5 . We expand the square: 

J2{X,j - Xik - Yij + Y,kf = 2 ^ Xfj + + 2X,jYik - X^^X^k - Y^jY^k - 2Xi,Y,, 


ijk 


ijk 


47^5^2 + 2 ^ 2XijYik — XijXik — YijYik — 2XijYij. (17) 

ijk 


We may study each of the cross-terms separately: for the XijYik term, note that Y^j Xij and Yk Xik 
are both 7^(i2-subgaussian (by Hoeffding’s inequality). Hence, YjkXijYk is C'7^d2-subexponential 


(by Lemma C.2) and so by Lemma C.3 

Pr I 

ijk 


> ^YBd\ I < 2ex.p{—cBd2). 
8 


The similar argument applies to the XijXi^ term: Yj Xij is 7^(i2-subgaussian and so Yijk XijXi^ = 
YiiYlj Xij)'^ is C'7^(i2-subexponential; hence 


Pr 


^ ^ XijXi]f 

ijk 


> ^Y^di 1 ^ 2 exp{—cBd2) ■ 


Of course, the YijYik term is identical. Finally, note that YijkXijYij = d 2 Yij XijYij. Since the 
terms in this sum are i.i.d., we may apply Hoeffding’s inequality to obtain 


Pr 


ijk 




> -YBdl I = Pr 


E-^'*x« 


1 


> LYBd 2 I < 2exp(— 


Putting everything together, we see that with high probability, the total of all the cross-terms 
in is at most half of the first term. □ 
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C.4 Completing the proof 


Let C denote the constant from Proposition C.4 Assume that di < d 2 and that m is large enough 
so 


m V “2 


(18) 


Note that under the assumptions A > 1 and m> di + d 2 from Theorem 3.2, the lower bound of (18) 


is satished. Moreover, if the upper bound of (18) is not satished then we may decrease A until it 


is; the conclusion of Theorem 3.2 will not be affected because as long as (18) fails, the minimum in 
Theorem 3.2 will be 1. 


By the lower bound in (18), there is an integer B such that 


„ Xm „ 


hx this B and define 7 by 


Y = XIB>,\ ^ 


Xd2 


m 


By the upper bound in (18), 7 < 1 


Now, Fano’s inequality states that if we first select a random A G A and then draw a sample 
from then any algorithm trying to identify X can succeed with probability at most 


mm{DiFx,mmy,m)) ■. X,Y € X} + 1 


log 1^1 


- < 


2C'm7^ ^ 1 
Bd2 2 


Finally, note that by the first inequality in Proposition C.4, the error incurred by choosing the 
wrong A G A is at least x \J 

Now, we have so far only discussed the case d 2 > di. The case di < d 2 is not exactly equivalent 
because our model is not symmetric in its treatment of users and items. However, the proof of 
Theorem 3.2 does not change very much. We take horizontally stacked blocks of size dix B instead 


of B X d 2 - The main difference is in the calculation leading to (16): there are extra cross-terms 


appearing due to the fact that items in different blocks need to be compared with one another. 
However, all of these additional terms may be controlled with Lemmas C.2| and C.3 in much the 
same way as the existing terms are controlled. 


D Comparison to Stochastic Gradient Descent 

Another practical algorithm to optimize (3) is Stochastic Gradient Descent (SGD). We have exper¬ 
imented SGD on the same datasets in Table 1. We ran the algorithm with the same regularization 
parameters and different step sizes. The statistical results for SGD were observed to be no better 
than AltSVM, and hence we did not present them in the main paper. 

Let us first describe the SGD procedure. At each step, ones chooses a triple G D 
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Datasets 

N 

NDCG@10 


20 

0.6852 

MLlm 

50 

0.7666 


100 

0.7728 


20 

0.6977 

MLlOm 

50 

0.7452 


100 

0.7659 


Table 5: NDCG@10 of SGD on different datasets, for different numbers of observed ratings per 
user. 


Precision® 

SGD with C = 5000 

1 

0.1556 

2 

0.1498 

5 

0.1236 

10 

0.1031 

100 

0.0441 


Table 6: Precision@/C for SGD of (3) on the binarized MovieLenslm dataset. 


uniformly at random and run a SGD step, which can be written as 


uf ^ Ui-rj ■ 
^ Vj -rj- 

^ Vj -7] ■ 




A 

-g-ui + 


|D^| 


Vk 


where denotes the number of comparisons in D which involve item j. r/ is a step size and 
g G dC{u](vj - Vk)). 

The following tables show the statistical result of SGD. The step size is chosen hj g = as 
suggested in [33]. a and /3 were the powers of 10“^, and the best result is reported. The results 
are comparable to AltSVM, but it did not achieve better results. We note that this is the best 
result from several different step sizes, while AltSVM does not have any other parameter to choose 
except for the regularization parameter. 
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